378 8.5 Advanced In Silico Analysis Tools
acid residues, of a molecule with unknown structure to generate an estimate for that struc
ture, known as the target, onto a similar structure of a molecule known as the template,
which is of the same or similar structural family. The algorithms have similarity to those used
in BLAST.
Protein structures are far more conserved than protein sequences among such homologs
perhaps for reasons of convergent evolution; similar tertiary structures evolve from different
primary structures which imparts a selective advantage on that organism (but note that pri
mary sequences with less than 20% sequence identity often belong to different structural
families). Protein structures are in general more conserved than nucleic acid structures.
Alignment in the homology fits is better in regions of distinct secondary structures being
forms (e.g., α-helices, β-sheets; see Chapter 2) and similarly is poorer in random coil primary
structure regions.
8.5.6 STEP DETECTION
An increasingly important computational tool is the automation of the detection of steps
in noisy data. This is especially valuable for data output from experimental single-molecule
biophysics techniques. The effective signal-to-noise ratio is often small and so the distinc
tion between real signal and noise is often challenging to make and needs to be objecti
fied. Also, single-molecule events are implicitly stochastic and depend upon the underlying
probability distribution for their occurrence that can often be far from simple. Thus, it is
important to acquire significant volumes of signal data, and therefore, an automated method
to extract real signal events is useful. The transition times between different molecular states
are often short compared to sampling time intervals such that a “signal” is often manifested
as a steplike change as a function of time in some physical output parameter, for example,
nanometer-level steps in rapid unfolding domain events of a protein stretched under force
(see Chapter 6), picoampere level steps in current in the rapid opening and closing of an
ion channel in a cell membrane (see Chapter 5), and rapid steps in brightness due to photo
bleaching events of single dye molecules (see the previous text).
One of the simplest and robust ways to detect steps in a noisy, extended time series
is to apply a running window filter that preserves the sharpness and position of a step
edge. A popular method uses a simple median filter, which runs a window of number
n consecutive data points in a time series on the data such that the running output is
the median from data points included in the window. A common method to deal with
a problem of potentially having 2n fewer data points in the output than in the raw data
is to reflect the first and last n data points to the beginning and end of the time series.
Another method uses the Chung–Kennedy filter. The Chung–Kennedy filter consists
of two adjacent windows of size n run across the data such that the output switches
between the two windows in being the mean value from the window that has the smallest
variance (Figure 8.10). The logic here is that if one edge encapsulates a step event, then
the variance in that window is likely to be higher.
Both median and Chung–Kennedy filters converge on the same expected value, though the
sample variance on the expected value of a mean distribution (i.e., the square of the standard
error of the mean) is actually marginally smaller than that of a median distribution; the vari
ance on the expected value from a sampled median distribution is σ2π/2n, which compares
with the sample mean of σ2/n (students seeking solace in high-level statistical theory should
see Mood et al., 1974), so the error on the expected median value will be larger by a factor of
~√(π/2) or ~25%. There is therefore an advantage in using the Chung–Kennedy filter. Both
filters require that the size of n is less than the typical interval between step events; otherwise,
the window encapsulates multiple steps and generates nonsensible outputs, so it requires
ideally some prior knowledge of likely stepping rates. These edge-preserving filters improve
the signal-to-noise ratio for noisy time series by a factor of ~√n. The decision of whether a
putative step event is real or not can be made on the basis of the probability of the observed
size of the putative step in light of the underlying noise. One way to achieve this is to perform
a Student’s t-test to examine if the mean values of the data, <x>, on either side over a window